Harmful algal blooms have been problematic in Lake Erie for over a decade, and have resulted in the reduction in fish population and biodiversity, as well as causing harm to populations residing in nearby areas. The neuro- and hepato- toxins produced by cyanobacteria are difficult to remove in drinking water treatment plants when in high concentrations and can lead to plant shutdown and therefore a reduction of potable water supply for the community. To assist in understanding and reducing bluegreen algae blooms, multiple buoys were deployed on the lake to collect water quality data, along with field sampling and processing done by labs throughout the area. With the long wait times and high costs associated with lab testing, a predictive model was created using deep learning to predict the algae concentration based on the water quality parameters collected by the buoys. This model is able to predict bluegreen algal bloom concentration with 87% accuracy with a slight overfitting causing the predictions to lean towards the more conservative estimate.
In the past decade, Lake Erie has seen high concentrations of cyanobacteria, or bluegreen algae. A Severity index was created to rank the algal blooms that occur each year, with the highest severities occuring in 2011 and 2015 with 10 and 10.5 respectively. Not all of the causes of the algal blooms have been determined, however, through research many causes have been identified. These include nutrient-rich water from waste water treatment plants, farm fields and fertilized lawns, invasive species, and warm shallow water in the lake. Furthermore, scientist consider nitrogen in the form of nitrate, and phosphorus to be the main culprit in bluegreen algae growth. (Dean, 2022)
To reduce the risk of harmful algal blooms, the stats of Michigan has planned to focus on reducing phosphorus loads from waste water treatment plants, and agricultural sources in the River Raisin and Maumee River Watersheds. Furthermore, forming collaborative partnerships to provide assistance to farmers and promote conservation practices. Currently local and state focus is on reducing the growth of harmful algae, but implementation of new policy takes time. (Dean, 2022)
To assist in research several buoys were placed in Lake Erie which take multiple water quality parameters that report to research labs throughout the area. Several of these labs also include field sampling data of physicochemical properties along with bluegreen algae concentrations. Using this data, a predictive model can be trained to predict harmful algal bloom concentrations and determine if the concentration is harmful to human and enviromental health.
Data were pulled into rstudio by reading html tables using the rvest package from the ERDDAP scientific database. This database houses data for water quality parameters provided from buoys, field sampling, and laboratory tests. Data were pulled for the year of 2022, although due to time matching, the data within the time periods from August to November were used.
The water quality parameters were chosen based on availablity and significance. Some important parameters of note are chlorophyll mass and flourescense, dissolved oxygen saturation mass and fractional, and phycocyanin flourescence. Looking further into these parameters, chlorophyll is used by bluegreen algae to collect photosynthetically active light and therefore may be important in predicting algae concentration (Robert A. Andersen, n.d.). Dissolved oxygen has been known to be depleted during periods of high algal bloom growth which can affect the growth of aquatic plants and animals (Ting-ting Wu, 2015). finally, Phycocyanin is a non-toxic, water-soluble pigment protein from microalgae that exhibits antioxidant, anti-inflammatory, hepatoprotective, and neuroprotective effects (Morais, 2018).
Since the dataset is about 1000 observations, and the predictive
model will require large amounts of data to train, imputation was used
rather than removing the columns containing missing data. The Amelia
package was used to impute the missing data. The Amelia package imputes
data by using the expectation maximization algorithm with bootstrapping.
Bootstrapping is a method of inferring results for a population from
results found on a collection of smaller random samples of that
population, using replacement during the sampling process. This
algorithm works by computing the expected value of the log likelihood
function with respect to the conditional distribution of Y given X using
the parameter estimates of the previous iteration. This is shown
as:
\[Q( \theta | \theta^{(t)} ) = E_{Y | X,
\theta^{(t)} }[ log \left ( L(\theta | X , Y ) \right
])\]
For the maximization step, the expectation is maximized before being
used again in the expectation equation. The maximization equations is
shown as:
\[(\theta^{(t+1)}=\arg\max_{\theta}Q(\theta|\theta^{(t)}))\]
Amelia will create copies of the dataset with new imputed values. The number of copies created will depend on the value for “m” entered. Further analysis is done on each of the “m” datasets so that a variance can be calculated. Distributions are then plotted to compare the original data distribution with the imputed distribution for each of the imputed features to validate the imputation.
A correlation and p-value matrix was generated for each feature by using the rcorr() function. The Spearman method was used due to its accuracy in both linear and non linear data. Pairs plots were created for each of the features to access linearity between features. This was done by using the plotly function with a “splom” input.
Each features importance was calculated using the Boruta package.
Boruta is a feature selection function which utilizes random forest
classification to determine the importance of each feature. This is done
by first creating “shadow features” by copying and randomizing each
original feature before appending to the original dataframe. This gives
a dataframe twice the size of the original. Following this, boruta
builds a Random Forest Classifier on the new feature space which
determines their importance using a statistical test, known as the
Z-Score.This algorithm checks if the original feature has higher
importance than the maximum importance of shadow features,
\[(Z-Score_{original} > Z-Score_{Max\,
shadow})\]
If the importance is found to be higher then the feature is recorded as
important, otherwise it is recorded as unimportant.
To ensure better optimization in the deep learning neural network,
the data was pre-processed by normalization. By normalizing all of the
data to values between 0 and 1, the deep learning network will be less
likely to get trapped in local extrema caused by highly flucuating
values. Instead, the algorithm will have shallower extrema and should be
able to converge easier. For normalization, the following function was
built,
\[x_{norm}=\left( \frac{x - min(x))}{(max(x)
- min(x)} \right)\]
Using the normalized data, training and testing data was created by taking random samples in a 90:10 split. This higher split was used due to using 20% of the training data as a validation step within the neural network. The output variable (bluegreen algae concentration) was left out of the training and testing data, but each (train/test) output was stored for cross validation.
The Deep learning neural network was built using the python wrapped Keras package. The network consisted of an input layer of 10 nodes (layer 1), followed by a hidden layer of 10 nodes (layer 2), a hidden layer of 120 nodes (layer 3), a dropout layer with a 30% rate (layer 4), and finally a layer with a single output node (layer 5). In each layer, the neural network utilizes an activation function which decides whether a neuron should be activated or not by calculating the weighted sum and further adding bias to it. Beginning with layer 2, the activation functions for each layer are relu, relu, and linear. The relu function stands for “rectified linear unit” and is a piecewise linear function that will output the input directly if it is positive, otherwise, it will output zero. The linear function, also known as “no activation” is where the activation is proportional to the input and simply returns the value it was given. The dropout layer is used to approximate training a large number of neural networks with different architectures in parallel. During training, a number of layer outputs are randomly ignored which has the effect of making the layer be treated like a layer with a different number of nodes and connectivity to the prior layer. This process attempts to create situations where network layers co-adapt to correct mistakes from prior layers, in turn making the model more robust.
The model was compiled using the mean-squared error for both the loss
function as well as the metric. The mean-squared error is given by
\[MSE=\left( \frac{1}{n}
\right)\sum_{i=1}^{n}(Y_i-Y'_i)^2\]
where n is the number of data points, \(Y_i\) is the observed value, and \(Y'_i\) is the predicted value. This
function measures error in statistical models by using the average
squared difference between observed and predicted values which tells how
close a regression line is to a set of points. Furthermore, Adam was
chosen as the optimizer for the model which is a replacement
optimization algorithm for stochastic gradient descent. Adam combines
properties of the AdaGrad and RMSProp algorithms to create an algorithm
that can handle sparse gradients on noisy data.
After compliation, the model was fitted using the normalized training data input along with the normalized training data output (bluegreen algae concentration) to be used for validation. A 20% validation split was created from this dataset and validated at each of the 200 epochs.
Using the normalized testing data created earlier, the model was evaluated and the correlation to the original data was plotted to show the linearity. The predicted and observed data were then categorized by the danger level of algae (safe, caution, danger) which was decided to be (x<0.6, 0.6<x<1, x>1). These values are based on hazardous levels of the toxins produced by the algae. A confusion matrix was created to determine the accuracy of the classification.
| …1 | time | longitude | latitude | chlorophyll_fluorescence | fractional_saturation_of_oxygen_in_sea_water | mass_concentration_of_blue_green_algae_in_sea_water | mass_concentration_of_blue_green_algae_in_sea_water_rfu | mass_concentration_of_chlorophyll_in_sea_water | mass_concentration_of_oxygen_in_sea_water | sea_surface_temperature | sea_water_electrical_conductivity | sea_water_ph_reported_on_total_scale | ammonia | phosphate | phycocyanin_fluorescence | nitrate | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 1.0 | Min. :2022-08-04 16:20:00 | Min. :82.94 | Min. :41.51 | Min. :0.1400 | Min. : 9.40 | Min. :0.0000 | Min. :0.220 | Min. : 0.000 | Min. :0.000790 | Min. :280.9 | Min. :0.02449 | Min. :7.350 | Min. : 0.0000 | Min. :0.00000 | Min. : 0.1600 | Min. : 0.000 | |
| 1st Qu.:247.8 | 1st Qu.:2022-08-15 08:25:00 | 1st Qu.:82.94 | 1st Qu.:41.51 | 1st Qu.:0.5800 | 1st Qu.: 75.26 | 1st Qu.:0.3600 | 1st Qu.:0.670 | 1st Qu.: 1.650 | 1st Qu.:0.006287 | 1st Qu.:292.0 | 1st Qu.:0.02650 | 1st Qu.:7.980 | 1st Qu.: 0.0350 | 1st Qu.:0.04800 | 1st Qu.: 0.4600 | 1st Qu.: 2.184 | |
| Median :494.5 | Median :2022-09-04 09:20:00 | Median :82.94 | Median :41.51 | Median :0.8000 | Median : 88.14 | Median :0.5100 | Median :0.820 | Median : 2.745 | Median :0.007465 | Median :296.9 | Median :0.02792 | Median :8.135 | Median : 0.0460 | Median :0.09300 | Median : 0.7500 | Median : 6.195 | |
| Mean :494.5 | Mean :2022-09-07 11:16:14 | Mean :82.94 | Mean :41.51 | Mean :0.9249 | Mean : 79.63 | Mean :0.6323 | Mean :0.949 | Mean : 3.338 | Mean :0.007200 | Mean :294.6 | Mean :0.02789 | Mean :8.129 | Mean : 0.9008 | Mean :0.08647 | Mean : 0.9437 | Mean : 9.018 | |
| 3rd Qu.:741.2 | 3rd Qu.:2022-09-25 03:30:00 | 3rd Qu.:82.94 | 3rd Qu.:41.51 | 3rd Qu.:1.1025 | 3rd Qu.: 93.74 | 3rd Qu.:0.7300 | 3rd Qu.:1.050 | 3rd Qu.: 4.188 | 3rd Qu.:0.008452 | 3rd Qu.:298.0 | 3rd Qu.:0.02881 | 3rd Qu.:8.310 | 3rd Qu.: 0.0630 | 3rd Qu.:0.11500 | 3rd Qu.: 1.2400 | 3rd Qu.: 8.422 | |
| Max. :988.0 | Max. :2022-10-31 05:30:00 | Max. :82.94 | Max. :41.51 | Max. :4.0100 | Max. :113.60 | Max. :5.1800 | Max. :5.610 | Max. :18.350 | Max. :0.011280 | Max. :300.2 | Max. :0.03693 | Max. :8.880 | Max. :89.7500 | Max. :1.96500 | Max. :13.1100 | Max. :128.660 | |
| NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA’s :3 | NA | NA’s :39 | NA’s :4 |
There appears to be missing data in some of the columns. From Figure 1, we can see that there is only a small amount (~1%) of missing data with the majority of it being in the nitrate data. The columns appear to have a mixture of missing at random and missing not at random data. Since the neural network requires large amounts of data to train, imputation was used rather than removing those observations. However, since some of the data is missing not at random a validation step was added to ensure the distribution of the imputed data followed the original data.
The data was imputed with five copies of the data. The variance in the imputed data sets was accounted for after running each copy through the model rather than averaging the data sets together beforehand. Table 2 shows one of the imputed data sets. Note that there are no longer missing values and the mean of the imputed features is still roughly the same (within 5%).
| chlorophyll_flourescence_rfu | oxygen_saturation_fraction | bluegreen_algae_conc_ug.L | bluegreen_algae_conc_rfu | chlorophyll_conc_kg.m3 | oxygen_conc_kg.m3 | temp_K | elec_cond_s.m | pH | ammonia_mg.L | phosphate_mg.L | phycocayanin_flour_rfu | nitrate_mg.L | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. :0.1400 | Min. : 9.40 | Min. :0.003866 | Min. :0.220 | Min. : 0.010 | Min. :0.000790 | Min. :280.9 | Min. :0.02449 | Min. :7.350 | Min. : 0.00100 | Min. :0.00100 | Min. : 0.1600 | Min. : 0.005 | |
| 1st Qu.:0.5800 | 1st Qu.: 75.26 | 1st Qu.:0.360000 | 1st Qu.:0.670 | 1st Qu.: 1.650 | 1st Qu.:0.006287 | 1st Qu.:292.0 | 1st Qu.:0.02650 | 1st Qu.:7.980 | 1st Qu.: 0.03575 | 1st Qu.:0.04900 | 1st Qu.: 0.4600 | 1st Qu.: 2.304 | |
| Median :0.8000 | Median : 88.14 | Median :0.510000 | Median :0.820 | Median : 2.745 | Median :0.007465 | Median :296.9 | Median :0.02792 | Median :8.135 | Median : 0.04700 | Median :0.09400 | Median : 0.7800 | Median : 6.625 | |
| Mean :0.9249 | Mean : 79.63 | Mean :0.632261 | Mean :0.949 | Mean : 3.338 | Mean :0.007200 | Mean :294.6 | Mean :0.02789 | Mean :8.129 | Mean : 0.96328 | Mean :0.08736 | Mean : 0.9737 | Mean : 9.747 | |
| 3rd Qu.:1.1025 | 3rd Qu.: 93.74 | 3rd Qu.:0.730000 | 3rd Qu.:1.050 | 3rd Qu.: 4.188 | 3rd Qu.:0.008452 | 3rd Qu.:298.0 | 3rd Qu.:0.02881 | 3rd Qu.:8.310 | 3rd Qu.: 0.06500 | 3rd Qu.:0.11500 | 3rd Qu.: 1.3125 | 3rd Qu.: 8.810 | |
| Max. :4.0100 | Max. :113.60 | Max. :5.180000 | Max. :5.610 | Max. :18.350 | Max. :0.011280 | Max. :300.2 | Max. :0.03693 | Max. :8.880 | Max. :89.75000 | Max. :1.96500 | Max. :13.1100 | Max. :128.660 |
To validate the imputation, the density of the imputed data sets and the original data were plotted to assess any differences. Figure 2 shows that for each of the features that required imputation the distribution is almost an exact match. This shows that the imputation was successful so the missing not at random data can still be utilized rather than removed.
From literature, it has been found that high levels of nitrogen, phosphorus, and oxygen can cause high levels of algal blooms. However, Figure 3 shows that bluegreen algae concentration only has high correlation with chlorophyll. The high correlation with chlorophyll is likely due to it’s use by bluegreen algae when performing photosynthesis. Due to this process, we would expect high levels of chlorophyll when there are high levels of bluegreen algae. Regarding the low nutrient correlations, one possibility could be due to algae ingesting different nitrogen and phosphorus compounds than phosphate, nitrate, and ammonia. If these compounds must be reduced to form compounds which can be injested by the algae, then these compounds would not show correlation to algae concentration in their current form.
| row | column | cor | p |
|---|---|---|---|
| chlorophyll_flourescence_rfu | oxygen_saturation_fraction | 0.0462035 | 0.1467167 |
| chlorophyll_flourescence_rfu | bluegreen_algae_conc_ug.L | 0.5777716 | 0.0000000 |
| oxygen_saturation_fraction | bluegreen_algae_conc_ug.L | -0.1968973 | 0.0000000 |
| chlorophyll_flourescence_rfu | bluegreen_algae_conc_rfu | 0.5778876 | 0.0000000 |
| oxygen_saturation_fraction | bluegreen_algae_conc_rfu | -0.1968976 | 0.0000000 |
| bluegreen_algae_conc_ug.L | bluegreen_algae_conc_rfu | 0.9998813 | 0.0000000 |
| chlorophyll_flourescence_rfu | chlorophyll_conc_kg.m3 | 0.9998631 | 0.0000000 |
| oxygen_saturation_fraction | chlorophyll_conc_kg.m3 | 0.0457179 | 0.1510146 |
| bluegreen_algae_conc_ug.L | chlorophyll_conc_kg.m3 | 0.5776132 | 0.0000000 |
| bluegreen_algae_conc_rfu | chlorophyll_conc_kg.m3 | 0.5777307 | 0.0000000 |
| chlorophyll_flourescence_rfu | oxygen_conc_kg.m3 | -0.0837429 | 0.0084500 |
| oxygen_saturation_fraction | oxygen_conc_kg.m3 | 0.9348801 | 0.0000000 |
| bluegreen_algae_conc_ug.L | oxygen_conc_kg.m3 | -0.2045957 | 0.0000000 |
| bluegreen_algae_conc_rfu | oxygen_conc_kg.m3 | -0.2043150 | 0.0000000 |
| chlorophyll_conc_kg.m3 | oxygen_conc_kg.m3 | -0.0840331 | 0.0082248 |
| chlorophyll_flourescence_rfu | temp_K | 0.2705492 | 0.0000000 |
| oxygen_saturation_fraction | temp_K | -0.5206588 | 0.0000000 |
| bluegreen_algae_conc_ug.L | temp_K | 0.1610705 | 0.0000004 |
| bluegreen_algae_conc_rfu | temp_K | 0.1603892 | 0.0000004 |
| chlorophyll_conc_kg.m3 | temp_K | 0.2702133 | 0.0000000 |
| oxygen_conc_kg.m3 | temp_K | -0.7872609 | 0.0000000 |
| chlorophyll_flourescence_rfu | elec_cond_s.m | 0.1443712 | 0.0000052 |
| oxygen_saturation_fraction | elec_cond_s.m | -0.2905712 | 0.0000000 |
| bluegreen_algae_conc_ug.L | elec_cond_s.m | 0.0118127 | 0.7107536 |
| bluegreen_algae_conc_rfu | elec_cond_s.m | 0.0123224 | 0.6988679 |
| chlorophyll_conc_kg.m3 | elec_cond_s.m | 0.1439419 | 0.0000056 |
| oxygen_conc_kg.m3 | elec_cond_s.m | -0.4765012 | 0.0000000 |
| temp_K | elec_cond_s.m | 0.6460887 | 0.0000000 |
| chlorophyll_flourescence_rfu | pH | 0.2477013 | 0.0000000 |
| oxygen_saturation_fraction | pH | 0.6759774 | 0.0000000 |
| bluegreen_algae_conc_ug.L | pH | -0.0463170 | 0.1457257 |
| bluegreen_algae_conc_rfu | pH | -0.0472350 | 0.1378986 |
| chlorophyll_conc_kg.m3 | pH | 0.2470548 | 0.0000000 |
| oxygen_conc_kg.m3 | pH | 0.4099825 | 0.0000000 |
| temp_K | pH | 0.1994611 | 0.0000000 |
| elec_cond_s.m | pH | 0.0908866 | 0.0042485 |
| chlorophyll_flourescence_rfu | ammonia_mg.L | -0.0184501 | 0.5624230 |
| oxygen_saturation_fraction | ammonia_mg.L | -0.1310455 | 0.0000360 |
| bluegreen_algae_conc_ug.L | ammonia_mg.L | -0.0262731 | 0.4094138 |
| bluegreen_algae_conc_rfu | ammonia_mg.L | -0.0262415 | 0.4099783 |
| chlorophyll_conc_kg.m3 | ammonia_mg.L | -0.0184193 | 0.5630757 |
| oxygen_conc_kg.m3 | ammonia_mg.L | -0.1445060 | 0.0000051 |
| temp_K | ammonia_mg.L | 0.1369932 | 0.0000155 |
| elec_cond_s.m | ammonia_mg.L | 0.0622937 | 0.0502924 |
| pH | ammonia_mg.L | -0.0207024 | 0.5157084 |
| chlorophyll_flourescence_rfu | phosphate_mg.L | -0.0239740 | 0.4516202 |
| oxygen_saturation_fraction | phosphate_mg.L | -0.1299577 | 0.0000418 |
| bluegreen_algae_conc_ug.L | phosphate_mg.L | -0.0744718 | 0.0192259 |
| bluegreen_algae_conc_rfu | phosphate_mg.L | -0.0744626 | 0.0192409 |
| chlorophyll_conc_kg.m3 | phosphate_mg.L | -0.0210640 | 0.5084020 |
| oxygen_conc_kg.m3 | phosphate_mg.L | -0.2033884 | 0.0000000 |
| temp_K | phosphate_mg.L | 0.2509549 | 0.0000000 |
| elec_cond_s.m | phosphate_mg.L | 0.0876747 | 0.0058218 |
| pH | phosphate_mg.L | 0.0599259 | 0.0597102 |
| ammonia_mg.L | phosphate_mg.L | 0.0873662 | 0.0059976 |
| chlorophyll_flourescence_rfu | phycocayanin_flour_rfu | -0.2554151 | 0.0000000 |
| oxygen_saturation_fraction | phycocayanin_flour_rfu | 0.1742336 | 0.0000000 |
| bluegreen_algae_conc_ug.L | phycocayanin_flour_rfu | -0.1589443 | 0.0000005 |
| bluegreen_algae_conc_rfu | phycocayanin_flour_rfu | -0.1584730 | 0.0000006 |
| chlorophyll_conc_kg.m3 | phycocayanin_flour_rfu | -0.2556941 | 0.0000000 |
| oxygen_conc_kg.m3 | phycocayanin_flour_rfu | 0.2654157 | 0.0000000 |
| temp_K | phycocayanin_flour_rfu | -0.3191895 | 0.0000000 |
| elec_cond_s.m | phycocayanin_flour_rfu | -0.2754157 | 0.0000000 |
| pH | phycocayanin_flour_rfu | -0.0516825 | 0.1044748 |
| ammonia_mg.L | phycocayanin_flour_rfu | -0.0551432 | 0.0832003 |
| phosphate_mg.L | phycocayanin_flour_rfu | -0.1002687 | 0.0016014 |
| chlorophyll_flourescence_rfu | nitrate_mg.L | -0.1930433 | 0.0000000 |
| oxygen_saturation_fraction | nitrate_mg.L | 0.2918227 | 0.0000000 |
| bluegreen_algae_conc_ug.L | nitrate_mg.L | -0.1011573 | 0.0014537 |
| bluegreen_algae_conc_rfu | nitrate_mg.L | -0.1006212 | 0.0015412 |
| chlorophyll_conc_kg.m3 | nitrate_mg.L | -0.1926856 | 0.0000000 |
| oxygen_conc_kg.m3 | nitrate_mg.L | 0.4551739 | 0.0000000 |
| temp_K | nitrate_mg.L | -0.5652034 | 0.0000000 |
| elec_cond_s.m | nitrate_mg.L | -0.2478414 | 0.0000000 |
| pH | nitrate_mg.L | -0.1178570 | 0.0002050 |
| ammonia_mg.L | nitrate_mg.L | -0.0768239 | 0.0157232 |
| phosphate_mg.L | nitrate_mg.L | -0.1818843 | 0.0000000 |
| phycocayanin_flour_rfu | nitrate_mg.L | 0.3063759 | 0.0000000 |
The linearity of each feature was access in Figure 4. The linearity is reflective of the correlation plot showing some linearity between algae concentration and chlorophyll concentration, as well as parameters such as oxygen saturation and temperature.
Due to the low levels of correlation and linearity, Boruta was used to access the importance of each feature in the dataset. Figure 5 shows the results of the Boruta feature selection, which found that all of the features in the dataset are important in the prediction in bluegreen algae growth. However, as found before, chlorophyll is the most important feature in the data set, with ammonia being the least important. Although as stated previously, this may be due to algae using reduced forms of the nutrient compounds. If nutrient data for various forms of each nutrient were available, then the analysis would be more robust.
With the data normalized, the training and testing split was made (90:10) and the model was compiled with the Adam optimizer and mean squared error loss function. The results in Figure 6 show that the model predicts the testing data with about 88% accuracy for each of the imputed datasets. Now that each imputed dataset has been fitted to a model for predictions, the predicted values were averaged together and the variance was determined.
## Model: "sequential"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## dense_2 (Dense) (None, 10) 120
## dense_1 (Dense) (None, 120) 1320
## dropout (Dropout) (None, 120) 0
## dense (Dense) (None, 1) 121
## ================================================================================
## Total params: 1,561
## Trainable params: 1,561
## Non-trainable params: 0
## ________________________________________________________________________________
Figure 7 shows the averaged predictions along with error bars showing the variance, which are fairly low for each data point. Since the model appears to be over fitting, a confusion matrix was generated to determine if the predictions follow a conservative pattern. Three bins were created based on the safety level of the algae concentration, “safe” refers to concentrations under 0.6 ppb, “caution” for concentrations between 0.6-0.8 ppb, and “danger” for concentrations above 0.8 ppb.
As shown in the confusion matrix output, the model is skewed to the more conservative estimate which predicts danger or caution when conditions are actually safe. Although this is not ideal, the conservative estimate is favored over the alternative which would predict safe conditions when it is actually dangerous. Figure 8 shows the distribution of the predicted and observed bluegreen algae concentrations. The predicted results have a higher density around the mean and slightly higher concentration values. Further optimization of the model is need to reduce the over fitting, although around 85% accuracy is still an acceptable result. Future optimization and more data is needed on the deep learning neural network to reach higher accuracy results.
## Confusion Matrix and Statistics
##
## Reference
## Prediction Caution Danger Safe
## Caution 15 5 8
## Danger 1 7 1
## Safe 4 4 54
##
## Overall Statistics
##
## Accuracy : 0.7677
## 95% CI : (0.6721, 0.8467)
## No Information Rate : 0.6364
## P-Value [Acc > NIR] : 0.003609
##
## Kappa : 0.5614
##
## Mcnemar's Test P-Value : 0.121757
##
## Statistics by Class:
##
## Class: Caution Class: Danger Class: Safe
## Sensitivity 0.7500 0.43750 0.8571
## Specificity 0.8354 0.97590 0.7778
## Pos Pred Value 0.5357 0.77778 0.8710
## Neg Pred Value 0.9296 0.90000 0.7568
## Prevalence 0.2020 0.16162 0.6364
## Detection Rate 0.1515 0.07071 0.5455
## Detection Prevalence 0.2828 0.09091 0.6263
## Balanced Accuracy 0.7927 0.70670 0.8175
To combat the growth of harmful bluegreen algal blooms, multiple water quality parameters are being taken in real time along with field sampling of aqueous nutrients. To further assist in this, a model was trained to predict bluegreen algae concentrations by using the water quality data available from the buoys, as well as the nutrient sampling done in lab. Prior to model training, the data were imputed with 5 copies to remove missing values, which is reflected in the variance in the predicted values. Using a correlation matrix, it was found that chlorophyll was highly correlated with bluegreen algae concentration, but no other parameters were closely correlated. However, this could be due to other reactions happening with other parameters not shown. To access the importance of each feature, Boruta was used to ensure that the features being input into the model will not negatively impacts the weights and bias. Boruta determined that all of the features were important in determining bluegreen algae concentration.
The model was created with two hidden layers, on with 10 nodes, and one with 120 nodes and a layer dropout rate of 0.3. The Adam optimizer was used along with the mean squared error as the loss function and metric. Relu was used for the activation function in both hidden layers, with a linear activation function for the output. The model was evaulated using a training dataset created from a 90:10 split along with a 20% validation. Fitting the model to the testing data resulted in a 87% correlation between the predicted and observed values. Since the model was overfitting slightly, bins were created for safety levels, less than 0.6 is safe, between 0.6 and 0.8 is caution, and greater than 0.8 is danger. The confusion matrix determined that the model was in fact overfitting which caused slightly more conservative estimates, which in the case of public safety is the preferred estimate. Further optimization is needed on the deep learning neural network to increase accuracy beyond 90%. Furthermore, more data are needed across larger ranges of values to create a more robust model.
I would like to thank Professor Dinov for his guidance throughout this project. I would also like to thank the labs who allowed their data to be open source and easily available.
Brownlee, J. (2019, Jan 09). A Gentle Introduction to the Rectified Linear Unit (ReLU). Retrieved from Machine Learning Mastery: https://machinelearningmastery.com/rectified-linear-activation-function-for-deep-learning-neural-networks/
Dean, S. (2022, August 11). Harmful algal blooms in Lake Erie expected to be smaller this year, says NOAA. Retrieved from Michigan.gov: https://www.michigan.gov/egle/newsroom/mi-environment/2022/08/11/harmful-algal-blooms-in-lake-erie-expected-to-be-smaller-this-year-says-noaa#:~:text=In%20Lake%20Erie%2C%20several%20factors,aren’t%20quite%20understood%20yet.
Morais, M. G. (2018, Geb 1). Phycocyanin from Microalgae: Properties, Extraction and Purification, with Some Recent Applications. Retrieved from https://www.liebertpub.com/doi/10.1089/ind.2017.0009#:~:text=Phycocyanin%20is%20a%20non%2Dtoxic,%2C%20hepatoprotective%2C%20and%20neuroprotective%20effects.
Robert A. Andersen, R. A. (n.d.). Photosynthesis and light-absorbing pigments. Retrieved from Britannica: https://www.britannica.com/science/algae/Photosynthesis-and-light-absorbing-pigments
Seagull. (n.d.). Retrieved from Seagull: https://seagull.glos.org/map?coords=-83.4103060%2C41.6266502%2C10&lake=Erie&tags=platforms%3Abuoy%2Cweather%3A%2Cwater%3A%2Cfavorite%3A&platform=RBS-TOL
Ting-ting Wu, G.-f. L.-q.-y. (2015, Jan). Impacts of algal blooms accumulation on physiological ecology of water hyacinth. Retrieved from National Library of Midicine: https://pubmed.ncbi.nlm.nih.gov/25898654/